Skip to content

Support persist sandbox metadaba to database#730

Open
zhangjaycee wants to merge 10 commits intoalibaba:masterfrom
zhangjaycee:feature/db
Open

Support persist sandbox metadaba to database#730
zhangjaycee wants to merge 10 commits intoalibaba:masterfrom
zhangjaycee:feature/db

Conversation

@zhangjaycee
Copy link
Copy Markdown
Collaborator

close #729

Add DatabaseConfig dataclass (url field) to rock/config.py and wire it
into RockConfig both as a field and in the from_env() YAML parser.
- Add Base(DeclarativeBase) as the single SQLAlchemy declarative base
- Add SandboxRecord ORM model with all sandbox metadata columns
- Add LIST_BY_ALLOWLIST and _NOT_NULL_DEFAULTS class-level constants
- Add DatabaseProvider with async engine/session factory
- Add DatabaseConfig dataclass to RockConfig
- _convert_url handles sqlite://, postgresql://, and postgres:// (Heroku)
  shorthand; URLs with existing driver specifier pass through unchanged
- Default state column value uses string literal "pending" instead of
  State.PENDING enum instance for explicit column semantics
- Add SandboxTable with insert/get/update/delete/list_by/list_by_in
- _filter_data strips unknown keys; _NOT_NULL_DEFAULTS fills NOT NULL cols
- LIST_BY_ALLOWLIST prevents arbitrary column queries (injection guard)
- _record_to_sandbox_info uses lru_cache to avoid repeated get_type_hints
  calls in bulk list_by scenarios
- Add SandboxInfoField generated type and generation script
- Redis alive/timeout keys remain the source of truth for live state
- DB writes are fire-and-forget via asyncio.create_task + _safe_db_call
- batch_get: Redis hits served directly; DB fallback uses a single
  list_by_in("sandbox_id", miss_ids) query instead of N serial gets,
  leveraging the primary key index for O(1) lookup per row
- iter_alive_sandbox_ids queries DB by state IN (running, pending)
  instead of Redis scan_iter, enabling indexed filtering
…e to meta_repo

- Replace MetaStore with SandboxRepository throughout SandboxManager,
  GemManager, BaseManager, and SandboxProxyService
- Wire SandboxRepository (Redis + SandboxTable) in admin/main.py startup
- stop(): add early return after archive() in the ValueError except branch
  to prevent double archive when the Ray actor is already gone

Made-with: Cursor
- Add TestSandboxTableWithSQLite: full CRUD coverage using SQLite
  in-memory database (no external dependencies, runs in fast CI)
  including list_by_in, NOT NULL defaults, and noop-on-missing-id cases
- Add TestSandboxTableWithPostgres: PostgreSQL-specific tests (JSONB,
  real container) marked need_docker + need_database
- Add comprehensive SandboxRepository tests: create/update/delete/archive/
  get/exists/batch_get/list_by/refresh_timeout/is_expired
- Consistent lowercase "stopped" state string throughout test data,
  matching the State enum value convention (running/pending)
- Add single-column indexes on all commonly queried fields (user_id,
  state, namespace, experiment_id, cluster_name, image, host_ip,
  host_name, create_user_gray_flag)
- Add scripts/gen_ddl.py to emit CREATE TABLE / CREATE INDEX DDL
- Add *.db and ddl/ to .gitignore (generated artifacts)
OperatorContext was missing redis_provider, leaving RayOperator._redis_provider
as None. This caused the use_rocklet get_status path to crash with
'NoneType object has no attribute get' because build_sandbox_from_redis
skips the lookup entirely when redis_provider is None.
- Add update_version field to SandboxInfo/SandboxRecord and lock_sandbox_key() helper
- Add LockResult enum and lock operations to RedisProvider (create_and_acquire, acquire, optimistic_update, release)
- Add version-guarded update() to SandboxTable to skip stale writes
- Implement SandboxRepository lock context managers (create_and_acquire_lock, acquire_lock) and version-aware CRUD
- Wrap SandboxManager.start_async/stop with pessimistic lock context managers; remove _check_sandbox_exists_in_redis
- Add tests for optimistic update behaviour in SandboxRepository
- Add update_version to SandboxInfoField Literal type
- Document that create() does not write the lock key and update() no-ops
  without it; callers must use create_and_acquire_lock + create(..., version=).
- Add create_with_lock helper and use it in TestSandboxRepositoryWithDocker.
- Drop unused socket import in test_sandbox_repository.py.
self,
sandbox_id: str,
sandbox_info: SandboxInfo,
timeout_info: dict[str, str] | None = None,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeout的计算,都放到外面,单独抽个类

_ACTIVE_STATES: list[str] = [State.RUNNING, State.PENDING]


class SandboxRepository:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename MetaStore

else:
info_with_version = sandbox_info

self._fire_db_insert(sandbox_id, {**info_with_version, "update_version": version})
Copy link
Copy Markdown
Collaborator

@StephenRi StephenRi Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

全都同步调用

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Support persist sandbox info to databases

2 participants